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Abstract 

Conserved  Navier-Stokes  dynamics  can  be  exactly  simulated  by  lattice 
gas  methods.  This  work  studies  several  implementation  issues  of  lattice 
gas  automata  on  state-of-art  parallel  computer  systems.  We  present  per¬ 
formance  results  for  the  hexagonal  LGA  lattices  on  CAM-8  mesh,  CM-5 
fat-tree,  and  KSR1  hierarchical  rings  network  topologies.  The  benchmarks 
range  from  200  million  to  500  million  site  updates  per  second. 

I.  Introduction 

We  have  implemented  lattice-gas  automata  (LGA)  on  three  parallel  archi¬ 
tectures:  MIT’s  CAM-8,  Thinking  Machine  Corporations’s  CM-5,  and  Kendall 
Square  Research’s  KSR1.  For  a  16-bit  LGA  our  main  findings  are:  (1)  the 
CAM-8  appears  to  be  much  more  efficient  than  the  other  two  parallel  architec¬ 
tures,  delivering  25  million  sites  update  per  second  per  module;  (2)  the  CM-5 
can  simulate  larger  lattices  due  its  much  larger  memory  sizes  but  can  only  de¬ 
liver  about  1  million  sites  update  per  second  per  node;  and  (3)  the  KSR1  can 
simulate  about  the  same  size  lattice  as  the  CM-5  and  can  deliver  about  2  mil¬ 
lion  sites  update  per  node.  Eight  CAM-8  nodes,  each  clocked  at  25  MHz  with 
16  custom  memory  management  chips  controlling  222  16-bit  sites  (8  Mbytes 
DRAM)  using  double  buffered  look-up  table  (2  Mbytes  SRAM),  are  connected 
by  a  3D  mesh.  512  CM-5  nodes,  each  with  a  32  MHz  SPARC  processor  and 
four  16  MHz  vector  units  with  32  Mbytes,  are  connected  by  a  fat-tree  network. 
128  KSR1  nodes,  each  with  a  64-bit  20  MHz  processor  with  32  Mbytes,  are 
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connected  by  a  hierarchical  ring.  For  the  CM-5  and  KSR1  machines  the  inter¬ 
processing  communication  cost  is  small  relative  to  the  internal  streaming  and 
collision  computations  for  sufficiently  large  lattices.  This  suggests  the  relatively 
low  communication  bandwidth  of  the  two  commercial  machines  impose  no  se¬ 
rious  performance  degeneration.  We  also  find  a  linear  speed-up  with  increasing 
number  of  processors  given  large  lattice  sizes  (e.g.  2kxlk  lattice  or  larger). 

LG  A  is  a  new  approach  to  hydrodynamic  fluid  simulations  [1].  It  has  found 
application  to  multiphase  systems,  reaction-diffusion  systems,  thermohydrody¬ 
namics,  and  flow  through  porous  media.  Particles,  with  mass  to,  propagate 
on  a  spacetime  lattice  with  N  spatial  sites,  unit  cell  size  /,  time  unit  r,  with 
speed  c  =  1/t.  A  particle’s  state  is  completely  specified  at  some  time,  t,  by 
specifying  its  position  on  the  lattice,  x,  and  its  momentum,  p  =  mcea  where 
lattice  vector  are  ea  for  o  =  1,2 ,...,£?.  The  particles  obey  Pauli  exclusion 
since  only  one  particle  can  occupy  a  single  state  at  a  time.  The  total  number  of 
configurations  per  site  is  2B .  The  total  number  of  single  particle  states  available 
in  the  system  is  BN.  The  lattice-gas  cellular  automaton  equation  of  motion  is 
rj'0(x  +  lea,t  +  t)  =  na(x,t)  +  f2ffl(n(x,  t)),  where  the  particle  occupations  and 
collisions  are  denoted  by  na  andQa,  respectively. 

The  spatial  coordinates  of  the  lattice  sites  may  be  expressed  as  follows 
Xj j  —  \  mod  2 where  i  and  j  are  rectilinear  indices  which  spec¬ 

ify  the  data  memory  array  location  used  to  store  the  lattice-gas  site  data. 
Given  a  particle  at  site  it  may  be  shifted  along  vector  r  =  rea  to  a  re¬ 

mote  site  (j! ,j')a  by  the  following  mapping:  (?'  +  r12  [  —  mod2j  mod  2 r,  j  T  r)1  4, 
(i  —  j  —  mod2j  mod  2 r,j  =F  r)2  5,  (i  =F  r,  j) 3  6.  All  of  our  implementations  use 
these  streaming  relations  for  address  computations.  In  these  streaming  rela¬ 
tions,  the  modulus  operator  is  base  2  because  a  two-dimensional  hexagonal 
lattice  embedded  into  a  square  three-dimensional  mesh  is  pleated. 

II.  Implementations  on  the  CAM-8,  CM-5,  and  KSR1 

The  desktop  CAM-81  produced  by  Margolus  and  Toffoli  of  MIT  [2]  directly 
implements  in  hardware  lattice-gas  data  streaming  and  collisions.  The  node-to- 
node  communication  architecture  is  a  cartesian  three-dimensional  mesh.  Each 
site  of  the  lattice  has  a  “pile”  of  bits  (a  multiple  of  16).  Each  bit  of  the  pile, 
or  each  bit  plane,  is  “kicked”  through  the  lattice  in  any  directions.  The  spec¬ 
ification  of  the  x,y,  and  z  components  of  the  kicks  for  each  bit  plane  exactly 
defines  the  lattice.  The  kicks  determine  all  the  global  permutations  of  the  data. 
Local  permutations  of  data  occurs  within  the  bit  piles  are  computed  in  double- 
buffered  SRAM  look-up  tables  (LUTs)  that  store  all  possible  physical  events 
with  a  certain  input  and  output  configurations.  The  width  of  the  CAM-8  LUTs 
are  16-bits.  This  is  a  reasonable  width  satisfying  the  opposing  constraints  of 
model  complexity  versus  memory  size  limitations  for  the  SRAM.  Site  permuta¬ 
tions  of  data  wider  than  16-bits  must  be  implemented  in  several  successive  LUT 

1The  first  8-module  CAM-8  prototype  was  operational  in  the  fall  of  1992. 
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passes.  The  CAM-8  has  built-in  25-bit  event  counters  for  real-time  measure¬ 
ments.  We  have  used  this  feature  to  do  real-time  coarse-grain  block  averaging  of 
the  LGA  number  variables  and  to  compute  the  components  of  the  momentum 
vectors  for  each  block.  The  data  size  of  the  coarse-grained  data  is  sufficiently 
small  that  it  can  be  transferred  back  to  the  front-end  host  for  graphical  display 
as  an  evolving  flow  held  within  an  X- window. 

In  CAMForth,  we  implemented  a  two-speed  hexagonal  LGA  with  a  rest  par¬ 
ticle,  including  gravitational  forcing,  free-slip  and  no-slip  boundaries  which  may 
be  oriented  horizontally,  vertically,  or  inclined  ±60°,  and  heating  and  cooling 
sites  to  model  temperature  controlled  boundary  surfaces.  This  has  been  en¬ 
coded  within  a  single  LUT.  The  ability  of  encoding  such  complex  dynamics 
within  16-bits  is  one  of  the  remarkable  aspects  of  the  LGA  formalism  in  terms 
of  efficient  memory  use  affording  us  the  ability  to  do  hash  updating  from  pre¬ 
stored  collision  tables.  98%  of  the  216  entries  in  the  collision  tables  are  used  in 
this  model. 

The  Thinking  Machines  Corporations ’s  CM-5  is  a  parallel  computer  that  can 
contain  up  to  16384  processing  nodes.2  These  processing  nodes  are  all  connected 
via  a  “fat-tree”  communications  net  providing  general  inter-node  communica¬ 
tion.  A  front-end  host,  a  modified  SUN  workstation,  controls  these  processing 
nodes.  A  SPARC  processor  on  each  node  issues  instructions  to  the  vector  units 
and  performs  most  bookkeeping  tasks  while  the  vector  units  perform  arithmetic 
and  logical  operations  on  the  data.  Each  vector  unit  has  a  peak  rate  of  32  million 
64-bit  ops  (boating  point  or  integer)  for  a  combined  total  of  128  Mops/node. 
Each  node’s  memory  is  divided  into  8  Mbyte  banks,  one  for  each  vector  unit. 
Each  vector  unit  has  its  own  independent  128  Mbyte/sec  path  to  memory  for  a 
combined  memory  bandwidth  of  512  Mbyte/sec  for  each  node.  The  DoD  CM- 
5  at  the  Army  High  Performance  Computing  Research  Center  in  Minneapolis, 
Minnesota  contains  544  nodes  for  a  total  of  16  Gbytes  of  memory  and  64  Gops 
of  peak  processing  speed. 

We  have  implemented  an  LGA  simulator  on  the  CM-5  using  the  high  level 
parallel  C*-language  and  the  low-level  C/CMMD/DPEAC  language.  In  C*,  the 
geometry  of  the  problem  is  specihed  at  the  onset  by  defining  the  data  structure’s 
shape.  This  is  usually  a  D-dimensional  array  with  a  certain  size  in  each  dimen¬ 
sion.  The  shape  definition  defines  all  the  needed  communication  topology  for 
the  compiler.  It  is  possible  to  declare  Boolean  shapes  in  C*.  The  partitioning  of 
the  space  between  processors  is  handled  completely  by  the  C*  compiler.  Given 
a  certain  lattice  size,  for  example  1024  x  2048,  with  have  found  the  performance 
of  the  CM-5  to  vary  linearly  with  the  number  of  processing  nodes.  This  linear 
variation  is  expected  so  long  as  the  lattice  size  is  sufficiently  large.  We  did  sev¬ 
eral  runs  for  lattice  sizes  64  x  128, 128  x  256, . . . ,  8192  x  16384.  For  small  lattice 
sizes,  the  performance  is  very  poor,  on  the  order  of  a  million  site  updates  per 
seconds.  This  is  because  processor  to  processor  communication  bandwidth  lim¬ 
its  the  streaming  rate.  As  the  lattice  size  increases,  the  number  of  sites  interior 

2Currently  the  largest  CM-5  resides  at  Los  Alamos  National  Laboratory  with  1024  nodes. 
Our  work  has  been  done  on  DoD’s  544-node  CM-5. 
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to  the  node  grows  and  the  number  of  sites  on  the  partition  boundary  decreases. 
Consequently,  the  site  update  rate  continuously  improves  with  larger  lattices. 
The  update  rate  asymptotically  approaches  about  25  million  site  updates  per 
second  for  a  256-node  partition.  This  is  equivalent  to  approximately  100,000 
site  updates  per  processing  node.  The  C*  compiler  does  not  appear  to  be  very 
efficient  for  the  LGA  algorithm. 

In  our  C/CMMD/DPEAC  implementation  we  partition  the  lattice  into  equally 
sized  rectangles  with  one  node  (four  vector  units)  assigned  per  rectangle.  The 
streaming  phase  of  the  computation  uses  the  address  of  each  bit’s  destination 
in  a  pre-computed  table.  This  is  done  so  potentially  complex  addressing  cal¬ 
culations  are  performed  only  once  during  initialization.  Before  a  site  can  be 
updated,  a  communication  phase  must  take  place  so  each  site  can  access  all  its 
neighbors.  Communication  must  take  place  across  node  and  vector  unit  bound¬ 
aries.  The  communication  is  done  so  every  site  has  access  to  all  its  neighbor 
values  on  one  vector  unit’s  8  Mbyte  bank  of  memory.  After  the  streaming  op¬ 
eration  is  complete  the  new  site  value  is  run  through  a  LUT.  Each  vector  unit 
has  its  own  copy  of  the  LUT,  so  this  part  of  the  calculation  is  not  time  con¬ 
suming,  yet  it  consumes  are  large  amount  of  memory.  We  have  achieved  update 
rates  of  about  106  Msites/sec  on  a  128-node  partition  of  the  CM-5  for  the  FHP 
gas  model.  We  find  that  the  longer  the  system  is  across  each  node  the  greater 
the  realized  performance.  This  is  due  to  the  long  system  size  across  each  node 
increases  the  fraction  of  sites  in  the  interior  of  each  vector  unit  and  these  sites 
do  not  need  to  communicate  with  sites  on  adjacent  vector  units  or  processing 
nodes. 

The  KSR1  is  a  parallel  computer  based  on  a  shared-memory  model.  Sets  of 
32  processors  are  connected  by  a  ring  network  with  a  communication  bandwidth 
of  lGByte/sec  between  each  set.  Sets  of  rings  of  32  processors  are  in  turn 
connected  in  a  ring  at  the  next  level.  The  processors  are  clocked  at  20MHz.  The 
peak  performance  per  processor  is  40  Mflops.  Referencing  data  not  physically 
located  in  one  processor  causes  a  cache  miss  and  consequently  brings  in  a  cache 
line  of  128  bytes.  The  cost  of  the  cache  miss  depends  on  whether  the  data  is 
located  in  the  same  ring  or  not  and  can  cost  up  to  several  hundred  machine 
cycles.  Thus,  it  is  crucial  to  exploit  the  locality  of  data  in  programming  the 
KSR1. 

We  implemented  an  LGA  simulator  in  KSR  Fortran  -  an  extension  of  Fortran 
with  parallelizing  KSR  directives.  The  hexagonal  lattice  is  embedded  is  a  similar 
fashion  to  our  implementation  on  the  CAM-8  and  CM-5.  A  m  x  n  lattice  is 
partitioned  into  p  stripes  of  shape  n  by  m/p,  each  assigned  to  one  processor. 
The  boundaries  of  the  lattice  are  programmed  separately  and  all  the  processors 
share  the  work.  We  have  benchmarked  the  performance  in  two  ways:  varying  the 
number  of  processors  with  fixed  lattice  size  and  varying  the  lattice  size  for  fixed 
number  of  processors.  The  results  show  that  a)  the  speedup  is  close  to  linear 
as  we  increase  the  number  of  processors  and  b)  the  per  node  site  update  rate 
increases  with  the  lattice  size  but  slowly  decreases  as  the  number  of  processor 
increases.  The  KSR1  can  deliver  about  2  million  sites  update  per  second  per 
node  for  large  lattices. 
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III.  Conclusion 


The  CAM-8  architecture  is  an  elegant,  arguably  the  best,  distillation  of 
lattice  gas  dynamics  that  has  been  realized  in  low-cost  desktop  hardware.  We 
look  forward  to  the  construction  of  a  larger  CAM-8,  with  much  more  than  eight 
modules,  in  the  near  future.  The  CM-5  results  are  based  on  a  beta  version 
of  the  software  and  are  not  necessarily  representative  of  the  full  version  of  the 
software. 
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